Procesamiento de lenguaje natural (natural language processing) o lingüística computacional (computational linguistics).
Extraer significado algorítmicamente de textos.
Los computadores son buenos para procesar texto, pero no son buenos entendiéndolo. Por el contrario los humanos son buenos para entender texto, pero no son buenos para procesarlo.
Artistas:
1)Manuel_Medrano: -Canciones: Bajo_el_Agua_Una_y_otra_vez_La_Mujer_Que_Bota_Fuego_Si_Pudiera_La_Distancia{-} 2)Andres_Cepeda: -Canciones: Lo_Mejor_Que_Hay_en_mi_Vida_Mejor_que_a_ti_me_va_Besos_Usados_Tengo_Ganas{-} 3)Morat: -Canciones: Como_te_Atreves_Besos_en_Guerra_Aprender_a_Quererte_a_donde_Vamos{-} 4)Cara_Luna: -Canciones: Mi_Primer_Millon_Tabaco_Chanel_Pasos_Gigantes_Perderme_contigo{-}
https://www.letras.com/manuel-medrano/bajo-el-agua/ https://www.letras.com/manuel-medrano/una-y-otra-vez/ https://www.letras.com/manuel-medrano/la-mujer-que-bota-fuego/ https://www.letras.com/manuel-medrano/si-pudiera/ https://www.letras.com/manuel-medrano/la-distancia/ https://www.letras.com/andres-cepeda/lo-mejor-que-hay-en-mi-vida/ https://www.letras.com/andres-cepeda/mejor-que-a-ti-me-va/ https://www.letras.com/andres-cepeda/1792354/ https://www.letras.com/andres-cepeda/266343/ https://www.letras.com/morat/como-te-atreves/ https://www.letras.com/morat/besos-en-guerra/ https://www.letras.com/morat/aprender-a-quererte/ https://www.letras.com/morat/a-donde-vamos/ https://www.letras.com/bacilos/65340/ https://www.letras.com/bacilos/65341/ https://www.letras.com/bacilos/65342/ https://www.letras.com/bacilos/pasos-de-gigantes/ https://www.letras.com/bacilos/perderme-contigo/
Artistas:
1)Reykon: -Canciones: el_lider_El_chisme_Remix_Tu_Cuerpo_Me_Llama_Remix_El_Error_La_Santa_Ginza_Remix_Secretos_Domingo_Imaginandote{-} 2)J_Balvin: -Canciones: Ay_Vamos_6_Am_Rojo_Culpables_Safari_Mi_Gente_No_Es_Justo_Blanco{-} 3)Manuel_Turizo: -Canciones: Una_Lady_Como_Tu_Esclavos_de_tus_Besos_La_Bachata{-} 4)Maluma: -Canciones: Hawai_Borro_Cassette_El_Perdedor_Addicted_Carnaval_Cosas_Pendientes
https://www.youtube.com/watch?v=3xinCpjWxxU https://www.youtube.com/watch?v=jgQ2MSwgC6A https://www.youtube.com/watch?v=C3jp2lid58g https://www.youtube.com/watch?v=u5KFYnfKgWo https://www.youtube.com/watch?v=m8JoSkGVsFA https://www.youtube.com/watch?v=TapXs54Ah3E https://www.youtube.com/watch?v=yUV9JwiQLog&pp=ygUDNmFt https://www.youtube.com/watch?v=_tG70FWd1Ds&pp=ygUEcm9qbw%3D%3D https://www.youtube.com/watch?v=VYtJAuoZxcc&pp=ygUQdW5hIGxhZHkgY29tbyB0dQ%3D%3D https://www.youtube.com/watch?v=1afoVNPPQCI&pp=ygUUZXNjbGF2byBkZSB0dXMgYmVzb3M%3D https://www.youtube.com/watch?v=TiM_TFpT_DE&pp=ygUKbGEgYmFjaGF0YQ%3D%3D https://www.youtube.com/watch?v=ZFwpzIz8eWE&pp=ygUIc2VjcmV0b3M%3D https://www.youtube.com/watch?v=f7uFHxg6nks&pp=ygUMaW1hZ2luYW5kb3Rl https://www.youtube.com/watch?v=KIvhiN0WHfY&pp=ygUHZG9taW5nbw%3D%3D https://www.youtube.com/watch?v=JWESLtAKKlU&pp=ygUGc2FmYXJp https://www.youtube.com/watch?v=wnJ6LuUFpMo&pp=ygUIbWkgZ2VudGU%3D https://www.youtube.com/watch?v=2zn4dAuZ2RU&pp=ygULbm8gZXMganVzdG8%3D https://www.youtube.com/watch?v=8j1xiiAZhIQ&pp=ygUGYmxhbmNv https://www.youtube.com/watch?v=pK060iUFWXg&pp=ygUGaGF3YWlp https://www.youtube.com/watch?v=Xk0wdDTTPA0&pp=ygUVYm9ycm8gY2Fzc2V0dGUgbWFsdW1h0gcJCY0JAYcqIYzv https://www.youtube.com/watch?v=PJniSb91tvo&pp=ygULZWwgcGVyZGVkb3LSBwkJjQkBhyohjO8%3D https://www.youtube.com/watch?v=pMIHC_cItd4&pp=ygUPYWRkaWN0ZWQgbWFsdW1h https://www.youtube.com/watch?v=ufa0K9w9z2c&pp=ygUPY2FybmF2YWwgbWFsdW1h https://www.youtube.com/watch?v=6vPhcRew8hA&pp=ygUQY29zYXMgcGVuZGllbnRlcw%3D%3D
Artistas:
1)Los_De_Adentro: -Canciones: Nubes_Negras_Quiero_Amarte_No_Mas_Tal_Vez{-} 2)Kraken: -Canciones: Fragil_al_Viento_Vestido_de_Cristal_America_Silencioso_Amor_Arteciopelados_Baracunata_Florecita_Rockera{-} 3)Caifanes: -Canciones: Afuera_Viento_No_dejes_que{-} 4)Enanitos_Verdes: -Canciones: La_Muralla_Verde{-}
https://youtu.be/8_Tc5uP8SL4?si=DvK_wDQjGbTRIebT https://youtu.be/8hGAklEil10?si=G5-nZzi1n0rdJprE https://youtu.be/s09hOXaPhJ8?si=YLYijSYH1bMkCWnS https://youtu.be/HqiX6-f5w-s?si=Ar1XWmjQP62TSuJX https://youtu.be/1tVF5rpmFM4?si=N9ItbxZCNnGU8tBm https://youtu.be/I4YtarQbE7U?si=GbTw8n7ih3JVFAyN https://youtu.be/Pcy_F40W9EM?si=_DXgEk08DYAgLU2F https://youtu.be/Q3ReRsnYG4I?si=mmNq8bDT1ONyJpL1 https://youtu.be/mqOCHYhRaGY?si=hKE0l0zyVzcQQr8V https://youtu.be/ARR3gkzX8I0?si=JCz3UO1Uzjz6FijW https://youtu.be/DNbG5IIA71w?si=v1-D2Rjm-HVVkx_a https://youtu.be/9KIshSBiojI?si=gZcaeeLWVnf73mxa https://youtu.be/i17Go6G-siA?si=E5JJA9aPvYxDtITo https://youtu.be/tYGZ1YCD2YU?si=vIcjZ9h9ZK69-omp
Artistas:
1)Joe_Arroyo: -Canciones: Rebelión_En_Barranquilla_me_quedo_Tal_para_cual_Pa’l_bailador_Te_quiero_más{-} 2)Fruko_y_sus_Tesos: -Canciones: El_Preso_Los_Charcos_Cachondea_El_Ausente_El_Son_del_Tren{-} 3)Grupo_Niche: -Canciones: Sin_sentimiento_Algo_que_se_quede_Cali_pachanguero_Se_pareció_tanto_a_ti_Una_aventura{-} 4)Yuri_Buenaventura: -Canciones: No_Estoy_contigo_¿Dónde_Estás?_Salsa_Tu_Cancion_El_Guerrero{-}
https://www.youtube.com/watch?v=7HtWEPfQJxw https://www.youtube.com/watch?v=ms3VDvksgks https://www.youtube.com/watch?v=I3eXeVRqtHQ https://www.youtube.com/watch?v=o0Bn_qVzZvE https://www.youtube.com/watch?v=NlemaAlPeZs
##### importar datos
suppressMessages(suppressWarnings(library(readr)))
suppressMessages(suppressWarnings(library(tidyverse)))
# warnings debido a caracteres no UTF-8 o vacios ("")
# UTF-8 (8-bit Unicode Transformation Format) es un formato de codificación de caracteres
# capaz de codificar todos los code points validos en Unicode
text_Baladas <- read_csv("baladas.txt", col_names = FALSE, show_col_types = FALSE)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
class(text_Baladas)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
text_Baladas <- c(text_Baladas)
class(text_Baladas)
## [1] "list"
text_Baladas <- unlist(text_Baladas)
class(text_Baladas)
## [1] "character"
names(text_Baladas) <- NULL # importante!
head(text_Baladas, n = 3)
## [1] "Quiero volar contigo" "Muy alto en algún lugar"
## [3] "Quisiera estar contigo"
# Reggaeton
text_Reggaeton <- unlist(c(read_csv("Reggaeton_proyecto.txt", col_names = FALSE, show_col_types = FALSE)))
names(text_Reggaeton) <- NULL
# Rock
text_Rock_canciones <- unlist(c(read_csv("Rock_canciones.txt", col_names = FALSE, show_col_types = FALSE)))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
names(text_Rock_canciones) <- NULL
# Salsa
text_Salsa_canciones <- unlist(c(read_csv("Salsa_canciones.txt", col_names = FALSE, show_col_types = FALSE)))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
names(text_Salsa_canciones) <- NULL
##### data frame formato tidy
# Baladas
text_Baladas <- tibble(line = 1:length(text_Baladas), text = text_Baladas) # tibble en lugar de data_frame
class(text_Baladas)
## [1] "tbl_df" "tbl" "data.frame"
dim(text_Baladas)
## [1] 957 2
head(text_Baladas, n = 3)
## # A tibble: 3 × 2
## line text
## <int> <chr>
## 1 1 Quiero volar contigo
## 2 2 Muy alto en algún lugar
## 3 3 Quisiera estar contigo
# texto no normalizado
# no tiene "estructura" para analizar
# Reggaeton
text_Reggaeton<- tibble(line = 1:length(text_Reggaeton), text = text_Reggaeton)
# Rock canciones
text_Rock_canciones<- tibble(line = 1:length(text_Rock_canciones), text = text_Rock_canciones)
# Salsa canciones
text_Salsa_canciones<- tibble(line = 1:length(text_Salsa_canciones), text = text_Salsa_canciones)
Almacenar el texto en formato estructurado.
Token: unidad de análisis.
La tokenización básica consiste en que cada token es una palabra.
Formato de de un token por linea.
Por defecto se elimina la puntuación y se normaliza el texto a minúsculas (las tíldes no se eliminan por defecto).
suppressMessages(suppressWarnings(library(tidytext)))
suppressMessages(suppressWarnings(library(magrittr)))
##### tokenizacion formato tidy
# ---------- Baladas ----------
text_Baladas %<>%
unnest_tokens(input = text, output = word) %>%
filter(!is.na(word)) # importante!
class(text_Baladas)
## [1] "tbl_df" "tbl" "data.frame"
dim(text_Baladas)
## [1] 6071 2
head(text_Baladas, n = 10)
## # A tibble: 10 × 2
## line word
## <int> <chr>
## 1 1 quiero
## 2 1 volar
## 3 1 contigo
## 4 2 muy
## 5 2 alto
## 6 2 en
## 7 2 algún
## 8 2 lugar
## 9 3 quisiera
## 10 3 estar
# ---------- Reggaeton ----------
text_Reggaeton %<>%
unnest_tokens(input = text, output = word) %>%
filter(!is.na(word))
dim(text_Reggaeton)
## [1] 4432 2
head(text_Reggaeton, n = 10)
## # A tibble: 10 × 2
## line word
## <int> <chr>
## 1 1 el
## 2 1 chisme
## 3 1 remix
## 4 2 ayo
## 5 3 the
## 6 3 official
## 7 3 remix
## 8 3 baby
## 9 4 me
## 10 4 duele
# -----------Rock_canciones-------
text_Rock_canciones%<>%
unnest_tokens(input = text, output = word) %>%
filter(!is.na(word))
dim(text_Rock_canciones)
## [1] 2844 2
head(text_Rock_canciones, n = 10)
## # A tibble: 10 × 2
## line word
## <int> <chr>
## 1 1 los
## 2 1 de
## 3 1 adentro
## 4 1 nubes
## 5 1 negras
## 6 2 ti
## 7 2 movería
## 8 2 cielo
## 9 2 y
## 10 2 tierra
# -----------Salsa_canciones-------
text_Salsa_canciones%<>%
unnest_tokens(input = text, output = word) %>%
filter(!is.na(word))
dim(text_Salsa_canciones)
## [1] 5472 2
head(text_Salsa_canciones, n = 10)
## # A tibble: 10 × 2
## line word
## <int> <chr>
## 1 1 a
## 2 1 joe
## 3 1 arrollo
## 4 2 canciones
## 5 3 i
## 6 3 rebelion
## 7 4 quiero
## 8 4 contarle
## 9 4 mi
## 10 4 hermano
Remover:
##### texto con numeros?
# ---------- Baladas ----------
text_Baladas %>%
filter(grepl(pattern = '[0-9]', x = word)) %>%
count(word, sort = TRUE)
## # A tibble: 1 × 2
## word n
## <chr> <int>
## 1 29 1
# ---------- Reggaeton ----------
text_Reggaeton %>%
filter(grepl(pattern = '[0-9]', x = word)) %>%
count(word, sort = TRUE)
## # A tibble: 2 × 2
## word n
## <chr> <int>
## 1 6 5
## 2 440 1
# ------------Rock_canciones-----------
text_Rock_canciones %>%
filter(grepl(pattern = '[0-9]', x = word)) %>%
count(word, sort = TRUE)
## # A tibble: 0 × 2
## # ℹ 2 variables: word <chr>, n <int>
# ------------Salsa_canciones-----------
text_Salsa_canciones %>%
filter(grepl(pattern = '[0-9]', x = word)) %>%
count(word, sort = TRUE)
## # A tibble: 0 × 2
## # ℹ 2 variables: word <chr>, n <int>
##### remover texto con numeros
# ---------- Baladas ----------
text_Baladas %<>%
filter(!grepl(pattern = '[0-9]', x = word))
dim(text_Baladas)
## [1] 6070 2
# ---------- Reggaeton ----------
text_Reggaeton %<>%
filter(!grepl(pattern = '[0-9]', x = word))
dim(text_Reggaeton)
## [1] 4426 2
# -----------Rock_canciones-----
text_Rock_canciones %<>%
filter(!grepl(pattern = '[0-9]', x = word))
dim(text_Rock_canciones)
## [1] 2844 2
# -----------Salsa_canciones-----
text_Salsa_canciones %<>%
filter(!grepl(pattern = '[0-9]', x = word))
dim(text_Salsa_canciones)
## [1] 5472 2
##### stop words
# 3 diccionarios en ingles (onix, SMART, snowball) incluidos por defecto en tidytext
data(stop_words)
class(stop_words)
## [1] "tbl_df" "tbl" "data.frame"
dim(stop_words)
## [1] 1149 2
head(stop_words, n = 10)
## # A tibble: 10 × 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
table(stop_words$lexicon)
##
## onix SMART snowball
## 404 571 174
###### stop words
# no hay diccionarios en español disponibles en tidytext
# diccionario COUNTWORDSFREE en español (con acentos)
# http://countwordsfree.com/stopwords/spanish
# otras alternativas:
# https://github.com/stopwords-iso/stopwords-es
# de tm::stopwords("spanish")
# se conserva el mismo formato de los diccionarios en tidytext
stop_words_es <- tibble(word = unlist(c(read.table("Stopwords.txt", quote="\"", comment.char=""))), lexicon = "custom")
dim(stop_words_es)
## [1] 102 2
head(stop_words_es, n = 10)
## # A tibble: 10 × 2
## word lexicon
## <chr> <chr>
## 1 La custom
## 2 lo custom
## 3 las custom
## 4 un custom
## 5 una custom
## 6 de custom
## 7 en custom
## 8 con custom
## 9 por custom
## 10 para custom
##### remover stop words
# ---------- Baladas ----------
text_Baladas %<>%
anti_join(x = ., y = stop_words_es)
## Joining with `by = join_by(word)`
dim(text_Baladas)
## [1] 2720 2
head(text_Baladas, n = 10)
## # A tibble: 10 × 2
## line word
## <int> <chr>
## 1 1 quiero
## 2 1 volar
## 3 1 contigo
## 4 2 muy
## 5 2 alto
## 6 2 algún
## 7 2 lugar
## 8 3 quisiera
## 9 3 contigo
## 10 4 viendo
# ---------- Reggaeton ----------
text_Reggaeton %<>%
anti_join(x = ., y = stop_words_es)
## Joining with `by = join_by(word)`
dim(text_Reggaeton)
## [1] 2047 2
head(text_Reggaeton, n = 10)
## # A tibble: 10 × 2
## line word
## <int> <chr>
## 1 1 chisme
## 2 2 ayo
## 3 3 the
## 4 3 official
## 5 3 baby
## 6 4 duele
## 7 4 haberte
## 8 4 entregado
## 9 4 amor
## 10 4 puro
#-----------Rock_canciones------
text_Rock_canciones %<>%
anti_join(x = . , y = stop_words_es)
## Joining with `by = join_by(word)`
dim(text_Rock_canciones)
## [1] 1489 2
head(text_Rock_canciones, n = 10)
## # A tibble: 10 × 2
## line word
## <int> <chr>
## 1 1 adentro
## 2 1 nubes
## 3 1 negras
## 4 2 movería
## 5 2 cielo
## 6 2 tierra
## 7 2 pudiera
## 8 3 cuanto
## 9 3 daría
## 10 3 tenerte
#-----------Salsa_canciones------
text_Salsa_canciones %<>%
anti_join(x = . , y = stop_words_es)
## Joining with `by = join_by(word)`
dim(text_Salsa_canciones)
## [1] 2657 2
head(text_Salsa_canciones, n = 10)
## # A tibble: 10 × 2
## line word
## <int> <chr>
## 1 1 joe
## 2 1 arrollo
## 3 2 canciones
## 4 3 i
## 5 3 rebelion
## 6 4 quiero
## 7 4 contarle
## 8 4 hermano
## 9 5 pedacito
## 10 5 historia
##### remover acentos
replacement_list <- list('á' = 'a', 'é' = 'e', 'í' = 'i', 'ó' = 'o', 'ú' = 'u')
# ---------- Baladas ----------
text_Baladas %<>%
mutate(word = chartr(old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word))
dim(text_Baladas)
## [1] 2720 2
head(text_Baladas, n = 10)
## # A tibble: 10 × 2
## line word
## <int> <chr>
## 1 1 quiero
## 2 1 volar
## 3 1 contigo
## 4 2 muy
## 5 2 alto
## 6 2 algun
## 7 2 lugar
## 8 3 quisiera
## 9 3 contigo
## 10 4 viendo
# ---------- Reggaeton ----------
text_Reggaeton %<>%
mutate(word = chartr(old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word))
dim(text_Reggaeton)
## [1] 2047 2
head(text_Reggaeton, n = 10)
## # A tibble: 10 × 2
## line word
## <int> <chr>
## 1 1 chisme
## 2 2 ayo
## 3 3 the
## 4 3 official
## 5 3 baby
## 6 4 duele
## 7 4 haberte
## 8 4 entregado
## 9 4 amor
## 10 4 puro
#-----------------Rock_canciones-------------
text_Rock_canciones %<>%
mutate(word = chartr(old = names(replacement_list)%>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word))
dim(text_Rock_canciones)
## [1] 1489 2
head(text_Rock_canciones, n= 10)
## # A tibble: 10 × 2
## line word
## <int> <chr>
## 1 1 adentro
## 2 1 nubes
## 3 1 negras
## 4 2 moveria
## 5 2 cielo
## 6 2 tierra
## 7 2 pudiera
## 8 3 cuanto
## 9 3 daria
## 10 3 tenerte
#-----------------Salsa_canciones-------------
text_Salsa_canciones %<>%
mutate(word = chartr(old = names(replacement_list)%>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word))
dim(text_Salsa_canciones)
## [1] 2657 2
head(text_Salsa_canciones, n= 10)
## # A tibble: 10 × 2
## line word
## <int> <chr>
## 1 1 joe
## 2 1 arrollo
## 3 2 canciones
## 4 3 i
## 5 3 rebelion
## 6 4 quiero
## 7 4 contarle
## 8 4 hermano
## 9 5 pedacito
## 10 5 historia
##### top 10 de tokens mas frecuentes
# ---------- Baladas ----------
text_Baladas%>%
count(word, sort = TRUE) %>%
head(n = 10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 quiero 51
## 2 solo 36
## 3 vida 36
## 4 contigo 29
## 5 amor 23
## 6 besos 23
## 7 otra 23
## 8 se 21
## 9 como 20
## 10 mujer 19
# ---------- Reggaeton ----------
text_Reggaeton%>%
count(word, sort = TRUE) %>%
head(n = 10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 quiero 30
## 2 cama 19
## 3 ganas 19
## 4 recuerdo 18
## 5 santa 18
## 6 encanta 17
## 7 amor 16
## 8 baby 15
## 9 solo 15
## 10 dale 14
#------------Rock_canciones-------------
text_Rock_canciones%>%
count(word, sort = TRUE) %>%
head(n = 10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 solo 33
## 2 quiero 32
## 3 amor 26
## 4 afuera 16
## 5 tuyas 14
## 6 corazon 13
## 7 negras 13
## 8 nubes 13
## 9 que 13
## 10 adentro 12
#------------Salsa_canciones-------------
text_Salsa_canciones%>%
count(word, sort = TRUE) %>%
head(n = 10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 amor 53
## 2 son 52
## 3 mas 46
## 4 quiero 41
## 5 salsa 30
## 6 quedo 27
## 7 vida 27
## 8 mundo 25
## 9 negra 24
## 10 jamas 23
##### viz
suppressMessages(suppressWarnings(library(gridExtra)))
# ---------- Baladas ----------
text_Baladas %>%
count(word, sort = TRUE) %>%
filter(n > 14) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
theme_light() +
geom_col(fill = 'red4', alpha = 0.8) +
xlab(NULL) +
ylab("Frecuencia") +
coord_flip() +
ggtitle(label = 'Baladas: Conteo de palabras') -> p1
# ---------- Reggaeton ----------
text_Reggaeton %>%
count(word, sort = TRUE) %>%
filter(n > 10) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
theme_light() +
geom_col(fill = 'blue4', alpha = 0.8) +
xlab(NULL) +
ylab("Frecuencia") +
coord_flip() +
ggtitle(label = 'Reggaeton: Conteo de palabras') -> p2
#-----------Rock_canciones--------
text_Rock_canciones %>%
count(word, sort = TRUE) %>%
filter(n > 10) %>%
mutate(word = reorder(word,n)) %>%
ggplot(aes(x = word, y = n)) +
theme_light()+
geom_col(fill= 'purple4', alpha = 0.8)+
xlab(NULL)+
ylab("Frecuencia")+
coord_flip()+
ggtitle(label = 'Rock: Conteo de palabras') -> p3
#-----------Salsa_canciones--------
text_Salsa_canciones %>%
count(word, sort = TRUE) %>%
filter(n > 15) %>%
mutate(word = reorder(word,n)) %>%
ggplot(aes(x = word, y = n)) +
theme_light()+
geom_col(fill= 'yellow4', alpha = 0.8)+
xlab(NULL)+
ylab("Frecuencia")+
coord_flip()+
ggtitle(label = 'Salsa: Conteo de palabras') -> p4
# desplegar grafico
grid.arrange(p1, p2, p3, p4)
suppressMessages(suppressWarnings(library(wordcloud)))
###### viz
par(mfrow = c(2,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# ---------- Baladas ----------
set.seed(123)
text_Baladas %>%
count(word, sort = TRUE) %>%
with(wordcloud(words = word, freq = n, max.words = 12, colors = 'red4'))
title(main = "Baladas: Nube de Palabras")
# ---------- Reggaeton ----------
set.seed(123)
text_Reggaeton %>%
count(word, sort = TRUE) %>%
with(wordcloud(words = word, freq = n, max.words = 10, colors = 'blue4'))
## Warning in wordcloud(words = word, freq = n, max.words = 10, colors = "blue4"):
## quiero could not be fit on page. It will not be plotted.
title(main = "Reggaeton: Nube de Palabras")
#-------------Rock_canciones-----------
set.seed(123)
text_Rock_canciones %>%
count(word, sort = TRUE) %>%
with(wordcloud(words = word, freq = n, max.words = 10, colors = 'purple4'))
title(main = "Rock: Nube de Palabras")
#-------------Salsa_canciones-----------
set.seed(1234)
text_Salsa_canciones %>%
count(word, sort = TRUE) %>%
with(wordcloud(words = word, freq = n, max.words = 10, colors = 'yellow4'))
## Warning in wordcloud(words = word, freq = n, max.words = 10, colors =
## "yellow4"): quiero could not be fit on page. It will not be plotted.
title(main = "Salsa: Nube de Palabras")
##### frecuencias relativas de la palabras
bind_rows(mutate(.data = text_Baladas, author = "Baladas"),
mutate(.data = text_Reggaeton, author = "Reggaeton"),
mutate(.data = text_Rock_canciones, author = "Rock_canciones"),
mutate(.data = text_Salsa_canciones, author = "Salsa_canciones")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n/sum(n)) %>%
select(-n) %>%
spread(author, proportion, fill = 0) -> frec # importante!
frec %<>%
select(word, Baladas, Reggaeton, Rock_canciones, Salsa_canciones)
dim(frec)
## [1] 2502 5
head(frec, n = 10)
## # A tibble: 10 × 5
## word Baladas Reggaeton Rock_canciones Salsa_canciones
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 abandonados 0 0 0 0.000376
## 2 abandones 0 0 0.000672 0
## 3 abatidas 0 0 0 0.000376
## 4 abeja 0 0 0.00134 0
## 5 abrazame 0 0 0 0.000376
## 6 abrazarte 0.000368 0.000489 0 0
## 7 abrazo 0.000368 0 0 0.000376
## 8 abriendo 0 0 0.000672 0
## 9 abrigarte 0 0.000489 0 0
## 10 aburrida 0.000368 0 0 0
##### top 10 palabras en comun
# orden anidado respecto a Baladas, Reggaeton, Rock y Salsa
frec %>%
filter(Baladas !=0, Reggaeton != 0, Rock_canciones != 0) %>%
arrange(desc(Baladas), desc(Reggaeton), desc(Rock_canciones), desc(Salsa_canciones)) -> frec_comun
dim(frec_comun)
## [1] 100 5
head(frec_comun, n = 10)
## # A tibble: 10 × 5
## word Baladas Reggaeton Rock_canciones Salsa_canciones
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 quiero 0.0188 0.0147 0.0215 0.0154
## 2 solo 0.0132 0.00733 0.0222 0.00715
## 3 vida 0.0132 0.00489 0.00269 0.0102
## 4 contigo 0.0107 0.00489 0.00201 0.00151
## 5 amor 0.00846 0.00782 0.0175 0.0199
## 6 besos 0.00846 0.00244 0.00134 0.000753
## 7 mujer 0.00699 0.00195 0.00134 0.000753
## 8 nada 0.00662 0.00440 0.00403 0.00188
## 9 siempre 0.00625 0.00293 0.00269 0.00414
## 10 nunca 0.00588 0.00244 0.000672 0.00226
###### proporcion palabras en comun
dim(frec_comun)[1]/dim(frec)[1]
## [1] 0.03996803
##### correlacion de las frecuencias
# cuidado con los supuestos de la prueba
# es posible usar Bootstrap como alternativa
cor.test(x = frec$Baladas, y = frec$Baladas)
##
## Pearson's product-moment correlation
##
## data: frec$Baladas and frec$Baladas
## t = Inf, df = 2500, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 1 1
## sample estimates:
## cor
## 1
cor.test(x = frec_comun$Reggaeton, y = frec_comun$Reggaeton)
##
## Pearson's product-moment correlation
##
## data: frec_comun$Reggaeton and frec_comun$Reggaeton
## t = Inf, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 1 1
## sample estimates:
## cor
## 1
cor.test(x = frec_comun$Rock_canciones, y = frec_comun$Rock_canciones)
##
## Pearson's product-moment correlation
##
## data: frec_comun$Rock_canciones and frec_comun$Rock_canciones
## t = 664343859, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 1 1
## sample estimates:
## cor
## 1
cor.test(x = frec_comun$Salsa_canciones, y = frec_comun$Salsa_canciones)
##
## Pearson's product-moment correlation
##
## data: frec_comun$Salsa_canciones and frec_comun$Salsa_canciones
## t = Inf, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 1 1
## sample estimates:
## cor
## 1
A las palabras (tokens simples o unigramas) se les asigna un puntaje (escala, positivo/negativo, emoción).
El sentimiento se define como la suma del puntaje de las palabras individuales.
Diccionarios:
Objetivos:
Caveats:
##### sentiments
# 3 diccionarios en ingles (AFINN, Bing, NRC) incluidos por defecto en tidytext
# AFINN: Finn Arup Nielsen, escala de -5 a 5.
# http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010
# Bing: Bing Liu and collaborators, clasificacion binaria (+/-).
# https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
# NRC: Saif Mohammad and Peter Turney, clasificacion binaria (+/-) y algunas categorias.
# http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
# diccionarios
# no hay diccionarios en español disponibles en tidytext
# https://www.kaggle.com/datasets/rtatman/sentiment-lexicons-for-81-languages
positive_words <- read_csv("Positive_words.txt", col_names = "word", show_col_types = FALSE) %>%
mutate(sentiment = "Positivo")
negative_words <- read_csv("Negativewords.txt", col_names = "word", show_col_types = FALSE) %>%
mutate(sentiment = "Negativo")
sentiment_words <- bind_rows(positive_words, negative_words)
# comparacion de diccionarios
get_sentiments("bing") %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
sentiment_words %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 Negativo 95
## 2 Positivo 71
###### viz
suppressMessages(suppressWarnings(library(RColorBrewer)))
# ---------- Baladas ----------
text_Baladas %>%
inner_join(sentiment_words) %>%
count(word, sentiment, sort = TRUE) %>%
filter(n > 2) %>%
mutate(n = ifelse(sentiment == "Negativo", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col() +
scale_fill_manual(values = brewer.pal(8,'Dark2')[c(2,5)]) +
coord_flip(ylim = c(-7,7)) +
labs(y = "Frecuencia",
x = NULL,
title = "Baladas") +
theme_minimal() +
theme(legend.position = "none") -> p1
## Joining with `by = join_by(word)`
## Warning in inner_join(., sentiment_words): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 263 of `x` matches multiple rows in `y`.
## ℹ Row 2 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# ---------- Reggaeton ----------
text_Reggaeton %>%
inner_join(sentiment_words) %>%
count(word, sentiment, sort = TRUE) %>%
filter(n > 2) %>%
mutate(n = ifelse(sentiment == "Negativo", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col() +
scale_fill_manual(values = brewer.pal(8,'Dark2')[c(2,5)]) +
coord_flip(ylim = c(-7,7)) +
labs(y = "Frecuencia",
x = NULL,
title = "Reggaeton") +
theme_minimal() +
theme(legend.position = "none") -> p2
## Joining with `by = join_by(word)`
## Warning in inner_join(., sentiment_words): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1504 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
#-------------Rock_canciones------------------
text_Rock_canciones %>%
inner_join(sentiment_words) %>%
count(word, sentiment, sort = TRUE) %>%
filter(n > 2) %>%
mutate(n = ifelse(sentiment == "Negativo", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col() +
scale_fill_manual(values = brewer.pal(8,'Dark2')[c(2,5)]) +
coord_flip(ylim = c(-7,7)) +
labs(y = "Frecuencia",
x = NULL,
title = "Rock") +
theme_minimal() +
theme(legend.position = "none") -> p3
## Joining with `by = join_by(word)`
## Warning in inner_join(., sentiment_words): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 31 of `x` matches multiple rows in `y`.
## ℹ Row 72 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
#-------------Salsa_canciones------------------
text_Salsa_canciones %>%
inner_join(sentiment_words) %>%
count(word, sentiment, sort = TRUE) %>%
filter(n > 2) %>%
mutate(n = ifelse(sentiment == "Negativo", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col() +
scale_fill_manual(values = brewer.pal(8,'Dark2')[c(2,5)]) +
coord_flip(ylim = c(-7,7)) +
labs(y = "Frecuencia",
x = NULL,
title = "Salsa",
fill = "Sentimiento"
) +
theme_minimal() -> p4
## Joining with `by = join_by(word)`
## Warning in inner_join(., sentiment_words): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435 of `x` matches multiple rows in `y`.
## ℹ Row 71 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# desplegar grafico
grid.arrange(p1, p2, p3, p4, ncol = 4)
suppressMessages(suppressWarnings(library(reshape2))) # acast
##### viz
par(mfrow = c(2,2), mar = c(1,1,3,1), mgp = c(1,1,1))
# ---------- Baladas ----------
set.seed(123)
text_Baladas %>%
inner_join(sentiment_words) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(
colors = c("darkred", "darkgreen"),
title.size = 0.01,
title.colors = c("white", "white"),
family = "serif",
scale = c(3,1),
max.words = 50
)
## Joining with `by = join_by(word)`
## Warning in inner_join(., sentiment_words): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 263 of `x` matches multiple rows in `y`.
## ℹ Row 2 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
title(main = "Baladas: NB Sentimiento")
# ---------- Reggaeton ----------
set.seed(123)
text_Reggaeton %>%
inner_join(sentiment_words) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(
colors = c("red", "green"),
title.size = 1.5,
family = "serif",
scale = c(3,1),
max.words = 50
)
## Joining with `by = join_by(word)`
## Warning in inner_join(., sentiment_words): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1504 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
title(main = "Reggaeton: NB Sentimiento")
#--------Rock_canciones---------
set.seed(123)
text_Rock_canciones %>%
inner_join(sentiment_words) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(
colors = c("red", "green"),
title.size = 1.5,
family = "serif",
scale = c(3,1),
max.words = 50
)
## Joining with `by = join_by(word)`
## Warning in inner_join(., sentiment_words): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 31 of `x` matches multiple rows in `y`.
## ℹ Row 72 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
title(main = "Rock: NB Sentimiento")
#--------Salsa_canciones---------
set.seed(123)
text_Salsa_canciones %>%
inner_join(sentiment_words) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(
colors = c("darkred", "darkgreen"),
title.size = 1.5,
family = "serif",
scale = c(3,1),
max.words = 50
)
## Joining with `by = join_by(word)`
## Warning in inner_join(., sentiment_words): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435 of `x` matches multiple rows in `y`.
## ℹ Row 71 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
title(main = "Salsa: NB Sentimiento")
Se ha usado unnest_tokens para tokenizar por palabras
individuales.
Ahora se quiere tokenizar por secuencias de palabras.
¿Qué palabras tienden a seguir otras? ¿Qué palabras tienden a co-ocurrir juntas?
##### bigramas: Ejemplo cancion Rock_canciones
# texto
text <- c("Préstame tu peine",
"Y péiname el alma",
"Desenrédame",
"Fuera de este mundo",
"Dime que no estoy",
"Soñándote",
"Enséñame",
"De qué estamos hechos",
"Que quiero orbitar planetas",
"Hasta ver uno vació",
"Que quiero irme a vivir",
"Pero que sea contigo",
"Viento",
"Amárranos",
"Tiempo",
"Detente muchos años",
"Viento",
"Amárranos",
"Tiempo",
"Detente muchos años")
# convertir a data frame
text_df <- tibble(line = 1:length(text), text = text)
# tokenizar en bigramas
text_df %>%
unnest_tokens(tbl = ., input = text, output = bigram, token = "ngrams", n = 2) %>%
head(n = 10)
## # A tibble: 10 × 2
## line bigram
## <int> <chr>
## 1 1 préstame tu
## 2 1 tu peine
## 3 2 y péiname
## 4 2 péiname el
## 5 2 el alma
## 6 3 <NA>
## 7 4 fuera de
## 8 4 de este
## 9 4 este mundo
## 10 5 dime que
#### bigramas: Ejemplo cancion Baladas
# texto
text <- c("Que si pudiera darle vueltas a la Tierra una y otra vez",
"Yo buscaría de alguien con tus mismos ojos, con tus mismos labios",
"Con tu misma boca y con tu misma piel",
"Que si pudiera darle al tiempo otro poco de tiempo",
"Para comprender que sin ti, mi vida ya no la siento",
"Que el color se vuelve a blanco y negro",
"Y sé que la distancia me hizo ciego",
"En todos los momentos",
"Los que tenía que verte aquí, Que si pudiera darle vueltas a la Tierra una y otra vez",
"Yo buscaría de alguien con tus mismos ojos, con tus mismos labios",
"Con tu misma boca y con tu misma piel",
"Que si pudiera darle al tiempo otro poco de tiempo",
"Para comprender que sin ti, mi vida ya no la siento",
"Que el color se vuelve a blanco y negro",
"Y sé que la distancia me hizo ciego",
"En todos los momentos",
"Los que tenía que verte aquí")
#convertir a data frame
text_df <- tibble(line = 1:length(text), text = text)
# tokenizar en bigramas
text_df %>%
unnest_tokens(tbl = ., input = text, output = bigram, token = "ngrams", n = 2) %>%
head(n = 10)
## # A tibble: 10 × 2
## line bigram
## <int> <chr>
## 1 1 que si
## 2 1 si pudiera
## 3 1 pudiera darle
## 4 1 darle vueltas
## 5 1 vueltas a
## 6 1 a la
## 7 1 la tierra
## 8 1 tierra una
## 9 1 una y
## 10 1 y otra
##### bigramas: Ejemplo Reggaeton
# Texto
text <- c("No hay que sufrir, no hay que llorar",
"La vida es una y es un carnaval",
"baladas Lo malo se irá, todo pasará",
"La vida es una y es un carnaval",
"La vida es una y es un carnaval",
"La vida es una y es un carnaval",
"Seré tu ángel guardián",
"Tu mejor compañía",
"Toma fuerte mi mano",
"Te enseñaré a volar",
"Ya no habrá mal de amores",
"Vendrán tiempos mejores",
"Levanta ya tu mano que vinimos a gozar")
#convertir a data frame
text_df <- tibble(line = 1:length(text), text = text)
# tokenizar en bigramas
text_df %>%
unnest_tokens(tbl = ., input = text, output = bigram, token = "ngrams", n = 2) %>%
head(n = 10)
## # A tibble: 10 × 2
## line bigram
## <int> <chr>
## 1 1 no hay
## 2 1 hay que
## 3 1 que sufrir
## 4 1 sufrir no
## 5 1 no hay
## 6 1 hay que
## 7 1 que llorar
## 8 2 la vida
## 9 2 vida es
## 10 2 es una
#### bigramas: Ejemplo Salsa
# Texto
text <- c("Me siento en el techo y empiezo a ordenar para ti",
"Los besos que no pude dibujar",
"Sale el sol, me acaricia",
"Nace tu canción",
"La gente como tú, enciende mi ser",
"Perdóname si molesto, estas cosas que hago yo",
"Ya ves, ya he olvidado o si son verdes o son azul",
"Y te diré, tus ojos viven en mi",
"Y son los más tiernos de la mayoría")
#convertir a data frame
text_df <- tibble(line = 1:length(text), text = text)
# tokenizar en bigramas
text_df %>%
unnest_tokens(tbl = ., input = text, output = bigram, token = "ngrams", n = 2) %>%
head(n = 10)
## # A tibble: 10 × 2
## line bigram
## <int> <chr>
## 1 1 me siento
## 2 1 siento en
## 3 1 en el
## 4 1 el techo
## 5 1 techo y
## 6 1 y empiezo
## 7 1 empiezo a
## 8 1 a ordenar
## 9 1 ordenar para
## 10 1 para ti
##### importar datos
text_Rock_canciones <- unlist(c(read_csv("Rock_canciones.txt", col_names = FALSE, show_col_types = FALSE)))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
names(text_Rock_canciones) <- NULL
text_Rock_canciones <- tibble(line = 1:length(text_Rock_canciones), text = text_Rock_canciones)
##### importar datos
text_Baladas <- unlist(c(read_csv("Baladas.txt", col_names = FALSE, show_col_types = FALSE)))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
names(text_Baladas) <- NULL
text_Baladas<- tibble(line = 1:length(text_Baladas), text = text_Baladas)
##### importar datos
text_Reggaeton <- unlist(c(read_csv("Reggaeton_proyecto.txt", col_names = FALSE, show_col_types = FALSE)))
names(text_Reggaeton) <- NULL
text_Reggaeton <- tibble(line = 1:length(text_Reggaeton), text = text_Reggaeton)
##### importar datos
text_Salsa_canciones <- unlist(c(read_csv("Salsa_canciones.txt", col_names = FALSE, show_col_types = FALSE)))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
names(text_Salsa_canciones) <- NULL
text_Salsa_canciones <- tibble(line = 1:length(text_Salsa_canciones), text = text_Salsa_canciones)
##### tokenizar en bigramas
# en este caso cada token es un bigrama
text_Rock_canciones %>%
unnest_tokens(tbl = ., input = text, output = bigram, token = "ngrams", n = 2) %>%
filter(!is.na(bigram)) -> text_Rock_canciones_bi # importante!
dim(text_Rock_canciones_bi)
## [1] 2210 2
head(text_Rock_canciones_bi, n = 10)
## # A tibble: 10 × 2
## line bigram
## <int> <chr>
## 1 1 los de
## 2 1 de adentro
## 3 1 adentro nubes
## 4 1 nubes negras
## 5 2 ti movería
## 6 2 movería cielo
## 7 2 cielo y
## 8 2 y tierra
## 9 2 tierra si
## 10 2 si pudiera
##### tokenizar en bigramas
# en este caso cada token es un bigrama
text_Baladas %>%
unnest_tokens(tbl = ., input = text, output = bigram, token = "ngrams", n = 2) %>%
filter(!is.na(bigram)) -> text_Baladas_bi # importante!
dim(text_Baladas_bi)
## [1] 5114 2
head(text_Baladas_bi, n = 10)
## # A tibble: 10 × 2
## line bigram
## <int> <chr>
## 1 1 quiero volar
## 2 1 volar contigo
## 3 2 muy alto
## 4 2 alto en
## 5 2 en algún
## 6 2 algún lugar
## 7 3 quisiera estar
## 8 3 estar contigo
## 9 4 viendo las
## 10 4 las estrellas
##### tokenizar en bigramas
# en este caso cada token es un bigrama
text_Reggaeton %>%
unnest_tokens(tbl = ., input = text, output = bigram, token = "ngrams", n = 2) %>%
filter(!is.na(bigram)) -> text_Reggaeton_bi # importante!
dim(text_Reggaeton_bi)
## [1] 3693 2
head(text_Reggaeton_bi, n = 10)
## # A tibble: 10 × 2
## line bigram
## <int> <chr>
## 1 1 el chisme
## 2 1 chisme remix
## 3 3 the official
## 4 3 official remix
## 5 3 remix baby
## 6 4 me duele
## 7 4 duele haberte
## 8 4 haberte entregado
## 9 4 entregado un
## 10 4 un amor
##### tokenizar en bigramas
# en este caso cada token es un bigrama
text_Salsa_canciones %>%
unnest_tokens(tbl = ., input = text, output = bigram, token = "ngrams", n = 2) %>%
filter(!is.na(bigram)) -> text_Salsa_canciones_bi # importante!
dim(text_Salsa_canciones_bi)
## [1] 4517 2
head(text_Salsa_canciones_bi, n = 10)
## # A tibble: 10 × 2
## line bigram
## <int> <chr>
## 1 1 a joe
## 2 1 joe arrollo
## 3 3 i rebelion
## 4 4 quiero contarle
## 5 4 contarle mi
## 6 4 mi hermano
## 7 5 un pedacito
## 8 5 pedacito de
## 9 5 de la
## 10 5 la historia
###### top 10 de bigramas mas frecuentes
# hay bigramas que no son interesantes (e.g., "de la")
# esto motiva el uso de stop words nuevamente
text_Rock_canciones_bi %>%
count(bigram, sort = TRUE) %>%
head(n = 10)
## # A tibble: 10 × 2
## bigram n
## <chr> <int>
## 1 de mi 16
## 2 lo que 15
## 3 hoy quiero 14
## 4 oh oh 14
## 5 nubes negras 13
## 6 el album 12
## 7 mi cabeza 12
## 8 mi corazón 12
## 9 album de 11
## 10 cabeza sólo 11
###### top 10 de bigramas mas frecuentes
# hay bigramas que no son interesantes (e.g., "de la")
# esto motiva el uso de stop words nuevamente
text_Baladas_bi %>%
count(bigram, sort = TRUE) %>%
head(n = 10)
## # A tibble: 10 × 2
## bigram n
## <chr> <int>
## 1 oh oh 120
## 2 mmh mmh 49
## 3 de ti 34
## 4 no se 33
## 5 que no 31
## 6 que te 28
## 7 en la 24
## 8 mi vida 20
## 9 oh ooh 20
## 10 no es 19
###### top 10 de bigramas mas frecuentes
# hay bigramas que no son interesantes (e.g., "de la")
# esto motiva el uso de stop words nuevamente
text_Reggaeton_bi %>%
count(bigram, sort = TRUE) %>%
head(n = 10)
## # A tibble: 10 × 2
## bigram n
## <chr> <int>
## 1 en mi 30
## 2 no te 27
## 3 lo que 25
## 4 que no 25
## 5 no no 24
## 6 que te 23
## 7 que me 22
## 8 se que 20
## 9 la santa 18
## 10 mi cama 18
###### top 10 de bigramas mas frecuentes
# hay bigramas que no son interesantes (e.g., "de la")
# esto motiva el uso de stop words nuevamente
text_Salsa_canciones_bi %>%
count(bigram, sort = TRUE) %>%
head(n = 10)
## # A tibble: 10 × 2
## bigram n
## <chr> <int>
## 1 no no 36
## 2 me quedo 27
## 3 a la 26
## 4 el mundo 24
## 5 que te 24
## 6 la vida 23
## 7 que no 22
## 8 le pegue 21
## 9 no le 21
## 10 pegue a 21
##### omitir stop words
text_Rock_canciones_bi %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!grepl(pattern = '[0-9]', x = word1)) %>%
filter(!grepl(pattern = '[0-9]', x = word2)) %>%
filter(!word1 %in% stop_words_es$word) %>%
filter(!word2 %in% stop_words_es$word) %>%
mutate(word1 = chartr(old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word1)) %>%
mutate(word2 = chartr(old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word2)) %>%
filter(!is.na(word1)) %>%
filter(!is.na(word2)) %>%
count(word1, word2, sort = TRUE) %>%
rename(weight = n) -> text_Rock_canciones_bi_counts # importante para la conformacion de la red!
dim(text_Rock_canciones_bi_counts)
## [1] 213 3
head(text_Rock_canciones_bi_counts, n = 10)
## # A tibble: 10 × 3
## word1 word2 weight
## <chr> <chr> <int>
## 1 nubes negras 13
## 2 cabeza solo 11
## 3 fotos tuyas 11
## 4 florecita rockera 9
## 5 solo adentro 9
## 6 quiero entender 8
## 7 solo quiero 8
## 8 diablo amor 7
## 9 muchos años 6
## 10 negras sobre 6
##### omitir stop words
text_Baladas_bi %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!grepl(pattern = '[0-9]', x = word1)) %>%
filter(!grepl(pattern = '[0-9]', x = word2)) %>%
filter(!word1 %in% stop_words_es$word) %>%
filter(!word2 %in% stop_words_es$word) %>%
mutate(word1 = chartr(old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word1)) %>%
mutate(word2 = chartr(old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word2)) %>%
filter(!is.na(word1)) %>%
filter(!is.na(word2)) %>%
count(word1, word2, sort = TRUE) %>%
rename(weight = n) -> text_Baladas_bi_counts # importante para la conformacion de la red!
dim(text_Baladas_bi_counts)
## [1] 396 3
head(text_Baladas_bi_counts, n = 10)
## # A tibble: 10 × 3
## word1 word2 weight
## <chr> <chr> <int>
## 1 solo quiero 16
## 2 besos matan 12
## 3 donde vamos 7
## 4 nadie ve 7
## 5 primer millon 7
## 6 bota fuego 6
## 7 matan morire 6
## 8 perderme contigo 6
## 9 pudiera darle 6
## 10 puede prohibir 6
##### omitir stop words
text_Reggaeton_bi %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!grepl(pattern = '[0-9]', x = word1)) %>%
filter(!grepl(pattern = '[0-9]', x = word2)) %>%
filter(!word1 %in% stop_words_es$word) %>%
filter(!word2 %in% stop_words_es$word) %>%
mutate(word1 = chartr(old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word1)) %>%
mutate(word2 = chartr(old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word2)) %>%
filter(!is.na(word1)) %>%
filter(!is.na(word2)) %>%
count(word1, word2, sort = TRUE) %>%
rename(weight = n) -> text_Reggaeton_bi_counts # importante para la conformacion de la red!
dim(text_Reggaeton_bi_counts)
## [1] 376 3
head(text_Reggaeton_bi_counts, n = 10)
## # A tibble: 10 × 3
## word1 word2 weight
## <chr> <chr> <int>
## 1 sigue bailando 8
## 2 bailando mami 6
## 3 cogi anoche 6
## 4 levanta baby 6
## 5 misma hora 6
## 6 necesita reggaeton 6
## 7 pantalon dale 6
## 8 quiero tenerte 6
## 9 reggaeton dale 6
## 10 rozamo algo 6
##### omitir stop words
text_Salsa_canciones_bi %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!grepl(pattern = '[0-9]', x = word1)) %>%
filter(!grepl(pattern = '[0-9]', x = word2)) %>%
filter(!word1 %in% stop_words_es$word) %>%
filter(!word2 %in% stop_words_es$word) %>%
mutate(word1 = chartr(old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word1)) %>%
mutate(word2 = chartr(old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word2)) %>%
filter(!is.na(word1)) %>%
filter(!is.na(word2)) %>%
count(word1, word2, sort = TRUE) %>%
rename(weight = n) -> text_Salsa_canciones_bi_counts # importante para la conformacion de la red!
dim(text_Salsa_canciones_bi_counts)
## [1] 470 3
head(text_Salsa_canciones_bi_counts, n = 10)
## # A tibble: 10 × 3
## word1 word2 weight
## <chr> <chr> <int>
## 1 quiero mas 17
## 2 negado amor 12
## 3 vida dura 12
## 4 jamas jamas 11
## 5 mas bonita 9
## 6 otro pasito 8
## 7 mas ni 7
## 8 cachondeas vagabundo 6
## 9 ese men 6
## 10 mundo quiere 6
##### definir una red a partir de la frecuencia (weight) de los bigramas
# binaria, no dirigida, ponderada, simple
# se recomienda variar el umbral del filtro y construir bigramas no consecutivos para obtener redes con mayor informacion
suppressMessages(suppressWarnings(library(igraph)))
g <- text_Rock_canciones_bi_counts %>%
filter(weight > 2) %>%
graph_from_data_frame(directed = FALSE)
# viz
set.seed(123)
plot(g, layout = layout_with_fr, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label.color = 'black', vertex.label.cex = 1, vertex.label.dist = 1, main = "Bigramas Rock con Umbral = 3")
##### definir una red a partir de la frecuencia (weight) de los bigramas
# binaria, no dirigida, ponderada, simple
# se recomienda variar el umbral del filtro y construir bigramas no consecutivos para obtener redes con mayor informacion
suppressMessages(suppressWarnings(library(igraph)))
g <- text_Baladas_bi_counts %>%
filter(weight > 2) %>%
graph_from_data_frame(directed = FALSE)
# viz
set.seed(123)
plot(g, layout = layout_with_fr, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label.color = 'purple', vertex.label.cex = 1, vertex.label.dist = 1, main = "Bigramas Baladas con Umbral = 1")
##### definir una red a partir de la frecuencia (weight) de los bigramas
# binaria, no dirigida, ponderada, simple
# se recomienda variar el umbral del filtro y construir bigramas no consecutivos para obtener redes con mayor informacion
suppressMessages(suppressWarnings(library(igraph)))
g <- text_Reggaeton_bi_counts %>%
filter(weight > 2) %>%
graph_from_data_frame(directed = FALSE)
# viz
set.seed(123)
plot(g, layout = layout_with_fr, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label.color = 'maroon', vertex.label.cex = 1, vertex.label.dist = 1, main = "Bigramas Reggaeton con Umbral = 2")
##### definir una red a partir de la frecuencia (weight) de los bigramas
# binaria, no dirigida, ponderada, simple
# se recomienda variar el umbral del filtro y construir bigramas no consecutivos para obtener redes con mayor informacion
suppressMessages(suppressWarnings(library(igraph)))
g <- text_Salsa_canciones_bi_counts %>%
filter(weight > 2) %>%
graph_from_data_frame(directed = FALSE)
# viz
set.seed(123)
plot(g, layout = layout_with_fr, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label.color = 'navyblue', vertex.label.cex = 1, vertex.label.dist = 1, main = "Bigramas Salsa con Umbral = 4")
##### red con un umbral diferente
g <- text_Rock_canciones_bi_counts %>%
filter(weight > 0) %>%
graph_from_data_frame(directed = FALSE)
# viz
set.seed(123)
plot(g, layout = layout_with_kk, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label = NA, main = "Umbral = 3")
##### red con un umbral diferente
g <- text_Baladas_bi_counts %>%
filter(weight > 0) %>%
graph_from_data_frame(directed = FALSE)
# viz
set.seed(123)
plot(g, layout = layout_with_kk, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label = NA, main = "Umbral = 1")
##### red con un umbral diferente
g <- text_Reggaeton_bi_counts %>%
filter(weight > 0) %>%
graph_from_data_frame(directed = FALSE)
# viz
set.seed(123)
plot(g, layout = layout_with_kk, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label = NA, main = "Umbral = 2")
##### red con un umbral diferente
g <- text_Salsa_canciones_bi_counts %>%
filter(weight > 0) %>%
graph_from_data_frame(directed = FALSE)
# viz
set.seed(123)
plot(g, layout = layout_with_kk, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label = NA, main = "Umbral = 4")
##### componente conexa mas grande de la red
g <- text_Rock_canciones_bi_counts %>%
filter(weight > 0) %>%
graph_from_data_frame(directed = FALSE)
# grafo inducido por la componente conexa
V(g)$cluster <- clusters(graph = g)$membership
## Warning: `clusters()` was deprecated in igraph 2.0.0.
## ℹ Please use `components()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
gcc <- induced_subgraph(graph = g, vids = which(V(g)$cluster == which.max(clusters(graph = g)$csize)))
par(mfrow = c(1,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# viz 1
set.seed(123)
plot(gcc, layout = layout_with_kk, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label.color = 'black', vertex.label.cex = 0.9, vertex.label.dist = 1)
# viz 2
set.seed(123)
plot(gcc, layout = layout_with_kk, vertex.color = adjustcolor('red4', 0.1), vertex.frame.color = 'red4', vertex.size = 2*strength(gcc), vertex.label.color = 'black', vertex.label.cex = 0.9, vertex.label.dist = 1, edge.width = 3*E(g)$weight/max(E(g)$weight))
title(main = "Componente conexa Rock Canciones", outer = T, line = -1)
##### componente conexa mas grande de la red
g <- text_Baladas_bi_counts %>%
filter(weight > 0) %>%
graph_from_data_frame(directed = FALSE)
# grafo inducido por la componente conexa
V(g)$cluster <- clusters(graph = g)$membership
gcc <- induced_subgraph(graph = g, vids = which(V(g)$cluster == which.max(clusters(graph = g)$csize)))
par(mfrow = c(1,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# viz 1
set.seed(123)
plot(gcc, layout = layout_with_kk, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label.color = 'black', vertex.label.cex = 0.6, vertex.label.dist = 2)
# viz 2
set.seed(123)
plot(gcc, layout = layout_with_kk, vertex.color = adjustcolor('green4', 0.1), vertex.frame.color = 'green4', vertex.size = 2*strength(gcc), vertex.label.color = 'black', vertex.label.cex = 0.9, vertex.label.dist = 1, edge.width = 3*E(g)$weight/max(E(g)$weight))
title(main = "Componente conexa Baladas", outer = T, line = -1)
##### componente conexa mas grande de la red
g <- text_Reggaeton_bi_counts %>%
filter(weight > 0) %>%
graph_from_data_frame(directed = FALSE)
# grafo inducido por la componente conexa
V(g)$cluster <- clusters(graph = g)$membership
gcc <- induced_subgraph(graph = g, vids = which(V(g)$cluster == which.max(clusters(graph = g)$csize)))
par(mfrow = c(1,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# viz 1
set.seed(123)
plot(gcc, layout = layout_with_kk, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label.color = 'black', vertex.label.cex = 0.6, vertex.label.dist = 2)
# viz 2
set.seed(123)
plot(gcc, layout = layout_with_kk, vertex.color = adjustcolor('blue4', 0.1), vertex.frame.color = 'blue4', vertex.size = 2*strength(gcc), vertex.label.color = 'black', vertex.label.cex = 0.9, vertex.label.dist = 1, edge.width = 3*E(g)$weight/max(E(g)$weight))
title(main = "Componente conexa Reggaeton", outer = T, line = -1)
##### componente conexa mas grande de la red
g <- text_Salsa_canciones_bi_counts %>%
filter(weight > 0) %>%
graph_from_data_frame(directed = FALSE)
# grafo inducido por la componente conexa
V(g)$cluster <- clusters(graph = g)$membership
gcc <- induced_subgraph(graph = g, vids = which(V(g)$cluster == which.max(clusters(graph = g)$csize)))
par(mfrow = c(1,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# viz 1
set.seed(123)
plot(gcc, layout = layout_with_kk, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label.color = 'black', vertex.label.cex = 0.6, vertex.label.dist = 2)
# viz 2
set.seed(123)
plot(gcc, layout = layout_with_kk, vertex.color = adjustcolor('purple4', 0.1), vertex.frame.color = '4', vertex.size = 2*strength(gcc), vertex.label.color = 'black', vertex.label.cex = 0.9, vertex.label.dist = 1, edge.width = 3*E(g)$weight/max(E(g)$weight))
title(main = "Componente conexa Baladas", outer = T, line = -1)
##### skip-gram: Ejemplo cancion Viento_Caifanes
# texto
text <- c("Préstame tu peine",
"Y péiname el alma",
"Desenrédame",
"Fuera de este mundo",
"Dime que no estoy",
"Soñándote",
"Enséñame",
"De qué estamos hechos",
"Que quiero orbitar planetas",
"Hasta ver uno vació",
"Que quiero irme a vivir",
"Pero que sea contigo",
"Viento",
"Amárranos",
"Tiempo",
"Detente muchos años",
"Viento",
"Amárranos",
"Tiempo",
"Detente muchos años")
# convertir a data frame
text_df <- tibble(line = 1:length(text), text = text)
# tokenizar en bigramas
text_df %>%
unnest_tokens(tbl = ., input = text, output = skipgram, token = "skip_ngrams", n = 2) %>%
head(n = 10)
## # A tibble: 10 × 2
## line skipgram
## <int> <chr>
## 1 1 préstame
## 2 1 préstame tu
## 3 1 préstame peine
## 4 1 tu
## 5 1 tu peine
## 6 1 peine
## 7 2 y
## 8 2 y péiname
## 9 2 y el
## 10 2 péiname
##### skip-gram: Ejemplo cancion: Tabaco y Chanel
# texto
text <- c("Un olor a tabaco y Chanel",
"Me recuerda el olor de su piel",
"Una mezcla de miel y café",
"Me recuerda el sabor de sus besos",
"El color del final de la noche",
"Me pregunta dónde fui a parar, dónde estás",
"Que esto solo se vive una vez",
"Dónde fuiste a parar, dónde estás",
"Un olor a tabaco y Chanel",
"Y una mezcla de miel y café",
"Me preguntan por ella (ella) Me",
"preguntan por ella",
"Me preguntan también las estrellas",
"Me reclaman que vuelva por ella",
"Ay, que vuelva por ella (ella)",
"Ay, que vuelva por ella)")
# convertir a data frame
text_df <- tibble(line = 1:length(text), text = text)
# tokenizar en bigramas
text_df %>%
unnest_tokens(tbl = ., input = text, output = skipgram, token = "skip_ngrams", n = 2) %>%
head(n = 10)
## # A tibble: 10 × 2
## line skipgram
## <int> <chr>
## 1 1 un
## 2 1 un olor
## 3 1 un a
## 4 1 olor
## 5 1 olor a
## 6 1 olor tabaco
## 7 1 a
## 8 1 a tabaco
## 9 1 a y
## 10 1 tabaco
##### skip-gram: Ejemplo cancion: Safari
# texto
text <- c("Oye, papi, vamos con mis amigas para el party",
"Tengo algo por un animal",
"Cuando mi gente está aquí, hay tsunami",
"Wavy, así es lo que me gusta",
"You know I like it when tú fresco",
"Me llamo princesa",
"Voy a coger provecho",
"Lo que me gusta",
"You know I like it when tú fresco",
"Me llamo princesa",
"Voy a coger provecho")
# convertir a data frame
text_df <- tibble(line = 1:length(text), text = text)
# tokenizar en bigramas
text_df %>%
unnest_tokens(tbl = ., input = text, output = skipgram, token = "skip_ngrams", n = 2) %>%
head(n = 10)
## # A tibble: 10 × 2
## line skipgram
## <int> <chr>
## 1 1 oye
## 2 1 oye papi
## 3 1 oye vamos
## 4 1 papi
## 5 1 papi vamos
## 6 1 papi con
## 7 1 vamos
## 8 1 vamos con
## 9 1 vamos mis
## 10 1 con
##### skip-gram: Ejemplo cancion: Salsa-Yuribuenaventura
# texto
text <- c("La salsa que aquí les traigo",
"la traigo directo mira",
"la traigo de las entrañas",
"de mi américa latina",
"El día que estes llorando",
"y tu alma se encuentre triste",
"si bailas salsa mi hermano",
"olvidarás que lo fuiste")
# convertir a data frame
text_df <- tibble(line = 1:length(text), text = text)
# tokenizar en bigramas
text_df %>%
unnest_tokens(tbl = ., input = text, output = skipgram, token = "skip_ngrams", n = 2) %>%
head(n = 10)
## # A tibble: 10 × 2
## line skipgram
## <int> <chr>
## 1 1 la
## 2 1 la salsa
## 3 1 la que
## 4 1 salsa
## 5 1 salsa que
## 6 1 salsa aquí
## 7 1 que
## 8 1 que aquí
## 9 1 que les
## 10 1 aquí
##### importar datos Rock
text_Rock_canciones <- unlist(c(read_csv("Rock_canciones.txt", col_names = FALSE, show_col_types = FALSE)))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
names(text_Rock_canciones) <- NULL
text_Rock_canciones <- tibble(line = 1:length(text_Rock_canciones), text = text_Rock_canciones)
##### importar datos Baladas
text_baladas <- unlist(c(read_csv("baladas.txt", col_names = FALSE, show_col_types = FALSE)))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
names(text_baladas) <- NULL
text_baladas <- tibble(line = 1:length(text_baladas), text = text_baladas)
##### importar datos Reggaeton
text_reggaeton <- unlist(c(read_csv("Reggaeton_proyecto.txt", col_names = FALSE, show_col_types = FALSE)))
names(text_reggaeton) <- NULL
text_reggaeton <- tibble(line = 1:length(text_reggaeton), text = text_reggaeton)
##### importar datos Salsa
text_salsa <- unlist(c(read_csv("Salsa_canciones.txt", col_names = FALSE, show_col_types = FALSE)))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
names(text_salsa) <- NULL
text_salsa <- tibble(line = 1:length(text_salsa), text = text_salsa)
##### tokenizar en skip-gram
# en este caso cada token es un unigrama o un bigrama regular o un bigrama con espaciamiento
# Rock
text_Rock_canciones %>%
unnest_tokens(tbl = ., input = text, output = skipgram, token = "skip_ngrams", n = 2) %>%
filter(!is.na(skipgram)) -> text_Rock_canciones_skip
dim(text_Rock_canciones_skip)
## [1] 6684 2
head(text_Rock_canciones_skip, n = 10)
## # A tibble: 10 × 2
## line skipgram
## <int> <chr>
## 1 1 los
## 2 1 los de
## 3 1 los adentro
## 4 1 de
## 5 1 de adentro
## 6 1 de nubes
## 7 1 adentro
## 8 1 adentro nubes
## 9 1 adentro negras
## 10 1 nubes
# Baladas
text_baladas %>%
unnest_tokens(tbl = ., input = text, output = skipgram, token = "skip_ngrams", n = 2) %>%
filter(!is.na(skipgram)) -> text_baladas_skip
dim(text_baladas_skip)
## [1] 15342 2
head(text_baladas_skip, n = 10)
## # A tibble: 10 × 2
## line skipgram
## <int> <chr>
## 1 1 quiero
## 2 1 quiero volar
## 3 1 quiero contigo
## 4 1 volar
## 5 1 volar contigo
## 6 1 contigo
## 7 2 muy
## 8 2 muy alto
## 9 2 muy en
## 10 2 alto
# Reggaeton
text_reggaeton %>%
unnest_tokens(tbl = ., input = text, output = skipgram, token = "skip_ngrams", n = 2) %>%
filter(!is.na(skipgram)) -> text_reggaeton_skip
dim(text_reggaeton_skip)
## [1] 11100 2
head(text_reggaeton_skip, n = 10)
## # A tibble: 10 × 2
## line skipgram
## <int> <chr>
## 1 1 el
## 2 1 el chisme
## 3 1 el remix
## 4 1 chisme
## 5 1 chisme remix
## 6 1 remix
## 7 2 ayo
## 8 3 the
## 9 3 the official
## 10 3 the remix
# Salsa
text_salsa %>%
unnest_tokens(tbl = ., input = text, output = skipgram, token = "skip_ngrams", n = 2) %>%
filter(!is.na(skipgram)) -> text_salsa_skip
dim(text_salsa_skip)
## [1] 13573 2
head(text_salsa_skip, n = 10)
## # A tibble: 10 × 2
## line skipgram
## <int> <chr>
## 1 1 a
## 2 1 a joe
## 3 1 a arrollo
## 4 1 joe
## 5 1 joe arrollo
## 6 1 arrollo
## 7 2 canciones
## 8 3 i
## 9 3 i rebelion
## 10 3 rebelion
##### remover unigramas
suppressMessages(suppressWarnings(library(ngram)))
# 1) Rock
# Contar palabras en cada skip-gram
text_Rock_canciones_skip$num_words <- text_Rock_canciones_skip$skipgram %>%
map_int(.f = ~ wordcount(.x))
head(text_Rock_canciones_skip, n = 10)
## # A tibble: 10 × 3
## line skipgram num_words
## <int> <chr> <int>
## 1 1 los 1
## 2 1 los de 2
## 3 1 los adentro 2
## 4 1 de 1
## 5 1 de adentro 2
## 6 1 de nubes 2
## 7 1 adentro 1
## 8 1 adentro nubes 2
## 9 1 adentro negras 2
## 10 1 nubes 1
# remover unigramas
text_Rock_canciones_skip %<>%
filter(num_words == 2) %>%
select(-num_words)
dim(text_Rock_canciones_skip)
## [1] 3840 2
head(text_Rock_canciones_skip, n = 10)
## # A tibble: 10 × 2
## line skipgram
## <int> <chr>
## 1 1 los de
## 2 1 los adentro
## 3 1 de adentro
## 4 1 de nubes
## 5 1 adentro nubes
## 6 1 adentro negras
## 7 1 nubes negras
## 8 2 ti movería
## 9 2 ti cielo
## 10 2 movería cielo
# 2) Baladas
# Contar palabras en cada skip-gram
text_baladas_skip$num_words <- text_baladas_skip$skipgram %>%
map_int(.f = ~ wordcount(.x))
head(text_baladas_skip, n = 10)
## # A tibble: 10 × 3
## line skipgram num_words
## <int> <chr> <int>
## 1 1 quiero 1
## 2 1 quiero volar 2
## 3 1 quiero contigo 2
## 4 1 volar 1
## 5 1 volar contigo 2
## 6 1 contigo 1
## 7 2 muy 1
## 8 2 muy alto 2
## 9 2 muy en 2
## 10 2 alto 1
# Remover unigramas (solo conservar los skip-grams de 2 palabras)
text_baladas_skip %<>%
filter(num_words == 2) %>%
select(-num_words)
dim(text_baladas_skip)
## [1] 9271 2
head(text_baladas_skip, n = 10)
## # A tibble: 10 × 2
## line skipgram
## <int> <chr>
## 1 1 quiero volar
## 2 1 quiero contigo
## 3 1 volar contigo
## 4 2 muy alto
## 5 2 muy en
## 6 2 alto en
## 7 2 alto algún
## 8 2 en algún
## 9 2 en lugar
## 10 2 algún lugar
# 3) Reggaeton
# Contar palabras en cada skip-gram
text_reggaeton_skip$num_words <- text_reggaeton_skip$skipgram %>%
map_int(.f = ~ wordcount(.x))
head(text_reggaeton_skip, n = 10)
## # A tibble: 10 × 3
## line skipgram num_words
## <int> <chr> <int>
## 1 1 el 1
## 2 1 el chisme 2
## 3 1 el remix 2
## 4 1 chisme 1
## 5 1 chisme remix 2
## 6 1 remix 1
## 7 2 ayo 1
## 8 3 the 1
## 9 3 the official 2
## 10 3 the remix 2
# Remover unigramas (solo conservar los skip-grams de 2 palabras)
text_reggaeton_skip %<>%
filter(num_words == 2) %>%
select(-num_words)
dim(text_reggaeton_skip)
## [1] 6668 2
head(text_reggaeton_skip, n = 10)
## # A tibble: 10 × 2
## line skipgram
## <int> <chr>
## 1 1 el chisme
## 2 1 el remix
## 3 1 chisme remix
## 4 3 the official
## 5 3 the remix
## 6 3 official remix
## 7 3 official baby
## 8 3 remix baby
## 9 4 me duele
## 10 4 me haberte
# 4) Salsa
# Contar palabras en cada skip-gram
text_salsa_skip$num_words <- text_salsa_skip$skipgram %>%
map_int(.f = ~ wordcount(.x))
head(text_salsa_skip, n = 10)
## # A tibble: 10 × 3
## line skipgram num_words
## <int> <chr> <int>
## 1 1 a 1
## 2 1 a joe 2
## 3 1 a arrollo 2
## 4 1 joe 1
## 5 1 joe arrollo 2
## 6 1 arrollo 1
## 7 2 canciones 1
## 8 3 i 1
## 9 3 i rebelion 2
## 10 3 rebelion 1
# Remover unigramas (solo conservar los skip-grams de 2 palabras)
text_salsa_skip %<>%
filter(num_words == 2) %>%
select(-num_words)
dim(text_salsa_skip)
## [1] 8101 2
head(text_salsa_skip, n = 10)
## # A tibble: 10 × 2
## line skipgram
## <int> <chr>
## 1 1 a joe
## 2 1 a arrollo
## 3 1 joe arrollo
## 4 3 i rebelion
## 5 4 quiero contarle
## 6 4 quiero mi
## 7 4 contarle mi
## 8 4 contarle hermano
## 9 4 mi hermano
## 10 5 un pedacito
##### Omitir stop words
##### Rock
text_Rock_canciones_skip %>%
separate(skipgram, c("word1", "word2"), sep = " ") %>%
filter(!grepl(pattern = '[0-9]', x = word1)) %>%
filter(!grepl(pattern = '[0-9]', x = word2)) %>%
filter(!word1 %in% stop_words_es$word) %>%
filter(!word2 %in% stop_words_es$word) %>%
mutate(word1 = chartr(old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word1)) %>%
mutate(word2 = chartr(old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word2)) %>%
filter(!is.na(word1)) %>%
filter(!is.na(word2)) %>%
count(word1, word2, sort = TRUE) %>%
rename(weight = n) -> text_Rock_canciones_skip_counts
dim(text_Rock_canciones_skip_counts)
## [1] 455 3
head(text_Rock_canciones_skip_counts, n = 10)
## # A tibble: 10 × 3
## word1 word2 weight
## <chr> <chr> <int>
## 1 nubes negras 13
## 2 cabeza solo 11
## 3 fotos tuyas 11
## 4 solo fotos 11
## 5 tuyas llena 11
## 6 florecita rockera 9
## 7 solo adentro 9
## 8 buscaste despertar 8
## 9 despertar pasion 8
## 10 encendiste hoguera 8
##### Baladas
text_baladas_skip %>%
separate(skipgram, c("word1", "word2"), sep = " ") %>%
filter(!grepl(pattern = '[0-9]', x = word1)) %>%
filter(!grepl(pattern = '[0-9]', x = word2)) %>%
filter(!word1 %in% stop_words_es$word) %>%
filter(!word2 %in% stop_words_es$word) %>%
mutate(word1 = chartr(
old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word1
)) %>%
mutate(word2 = chartr(
old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word2
)) %>%
filter(!is.na(word1)) %>%
filter(!is.na(word2)) %>%
count(word1, word2, sort = TRUE) %>%
rename(weight = n) -> text_baladas_skip_counts
dim(text_baladas_skip_counts)
## [1] 854 3
head(text_baladas_skip_counts, n = 10)
## # A tibble: 10 × 3
## word1 word2 weight
## <chr> <chr> <int>
## 1 solo quiero 16
## 2 atreves volver 12
## 3 besos matan 12
## 4 como atreves 12
## 5 quiero contigo 11
## 6 olvida nada 8
## 7 se se 8
## 8 donde vamos 7
## 9 nadie ve 7
## 10 primer millon 7
##### Reggaeton
text_reggaeton_skip %>%
separate(skipgram, c("word1", "word2"), sep = " ") %>%
filter(!grepl(pattern = '[0-9]', x = word1)) %>%
filter(!grepl(pattern = '[0-9]', x = word2)) %>%
filter(!word1 %in% stop_words_es$word) %>%
filter(!word2 %in% stop_words_es$word) %>%
mutate(word1 = chartr(
old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word1
)) %>%
mutate(word2 = chartr(
old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word2
)) %>%
filter(!is.na(word1)) %>%
filter(!is.na(word2)) %>%
count(word1, word2, sort = TRUE) %>%
rename(weight = n) -> text_reggaeton_skip_counts
dim(text_reggaeton_skip_counts)
## [1] 741 3
head(text_reggaeton_skip_counts, n = 10)
## # A tibble: 10 × 3
## word1 word2 weight
## <chr> <chr> <int>
## 1 hacerte amor 10
## 2 nena tranquilicese 8
## 3 sigue bailando 8
## 4 vida mia 8
## 5 cuerpo llama 7
## 6 bailando mami 6
## 7 clase rumba 6
## 8 cogi anoche 6
## 9 levanta baby 6
## 10 mami pare 6
##### Salsa
text_salsa_skip %>%
separate(skipgram, c("word1", "word2"), sep = " ") %>%
filter(!grepl(pattern = '[0-9]', x = word1)) %>%
filter(!grepl(pattern = '[0-9]', x = word2)) %>%
filter(!word1 %in% stop_words_es$word) %>%
filter(!word2 %in% stop_words_es$word) %>%
mutate(word1 = chartr(
old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word1
)) %>%
mutate(word2 = chartr(
old = names(replacement_list) %>% str_c(collapse = ''),
new = replacement_list %>% str_c(collapse = ''),
x = word2
)) %>%
filter(!is.na(word1)) %>%
filter(!is.na(word2)) %>%
count(word1, word2, sort = TRUE) %>%
rename(weight = n) -> text_salsa_skip_counts
dim(text_salsa_skip_counts)
## [1] 997 3
head(text_salsa_skip_counts, n = 10)
## # A tibble: 10 × 3
## word1 word2 weight
## <chr> <chr> <int>
## 1 quiero mas 18
## 2 barranquilla quedo 13
## 3 escuches canto 13
## 4 negado amor 12
## 5 vida dura 12
## 6 jamas jamas 11
## 7 son son 10
## 8 mas bonita 9
## 9 ae otro 8
## 10 aventura mas 8
##### definir una red a partir de la frecuencia (weight) de los bigramas
##### Rock
g <- text_Rock_canciones_skip_counts %>%
filter(weight > 0) %>%
graph_from_data_frame(directed = FALSE)
g <- igraph::simplify(g) # importante!
# grafo inducido por la componente conexa
V(g)$cluster <- clusters(graph = g)$membership
gcc <- induced_subgraph(graph = g, vids = which(V(g)$cluster == which.max(clusters(graph = g)$csize)))
par(mfrow = c(1,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# viz 1
set.seed(123)
plot(gcc, layout = layout_with_fr, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label = NA)
# viz 2
set.seed(123)
plot(gcc, layout = layout_with_fr, vertex.color = adjustcolor('red4', 0.1), vertex.frame.color = 'red4', vertex.size = 2*strength(gcc), vertex.label = NA)
title(main = "Componente conexa y Clusters - Rock", outer = T, line = -1)
##### Baladas
g <- text_baladas_skip_counts %>%
filter(weight > 0) %>%
graph_from_data_frame(directed = FALSE)
g <- igraph::simplify(g)
# Grafo inducido por la componente conexa
V(g)$cluster <- clusters(graph = g)$membership
gcc <- induced_subgraph(graph = g, vids = which(V(g)$cluster == which.max(clusters(graph = g)$csize)))
par(mfrow = c(1,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# Viz 1
set.seed(123)
plot(
gcc,
layout = layout_with_fr,
vertex.color = "skyblue",
vertex.frame.color = "black",
vertex.size = 2,
vertex.label = NA
)
# Viz 2
set.seed(123)
plot(
gcc,
layout = layout_with_fr,
vertex.color = adjustcolor('salmon', 0.1),
vertex.frame.color = 'salmon',
vertex.size = 2*strength(gcc),
vertex.label = NA
)
title(main = "Componente conexa y Clusters - Baladas", outer = TRUE, line = -1)
##### Reggaeton
g <- text_reggaeton_skip_counts %>%
filter(weight > 0) %>%
graph_from_data_frame(directed = FALSE)
g <- igraph::simplify(g)
# Grafo inducido por la componente conexa
V(g)$cluster <- clusters(graph = g)$membership
gcc <- induced_subgraph(graph = g, vids = which(V(g)$cluster == which.max(clusters(graph = g)$csize)))
par(mfrow = c(1,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# Viz 1
set.seed(123)
plot(
gcc,
layout = layout_with_fr,
vertex.color = "gold",
vertex.frame.color = "gold",
vertex.size = 2,
vertex.shape = "square",
vertex.label = NA
)
# Viz 2
set.seed(123)
plot(
gcc,
layout = layout_with_fr,
vertex.color = adjustcolor('lightgreen', 0.1),
vertex.frame.color = 'darkgreen',
vertex.size = 2*strength(gcc),
vertex.label = NA
)
title(main = "Componente conexa y Clusters - Reggaeton", outer = TRUE, line = -1)
##### Salsa
g <- text_salsa_skip_counts %>%
filter(weight > 0) %>%
graph_from_data_frame(directed = FALSE)
g <- igraph::simplify(g)
# Grafo inducido por la componente conexa
V(g)$cluster <- clusters(graph = g)$membership
gcc <- induced_subgraph(graph = g, vids = which(V(g)$cluster == which.max(clusters(graph = g)$csize)))
par(mfrow = c(1,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# Viz 1
set.seed(123)
plot(
gcc,
layout = layout_with_fr,
vertex.color = "firebrick4",
vertex.frame.color = "firebrick3",
vertex.size = 3,
vertex.shape = "pie",
vertex.label = NA
)
# Viz 2
set.seed(123)
plot(
gcc,
layout = layout_with_fr,
vertex.color = adjustcolor('chocolate4', 0.1),
vertex.frame.color = 'chocolate4',
vertex.size = 2*strength(gcc),
vertex.label = NA
)
title(main = "Componente conexa y Clusters - Salsa", outer = TRUE, line = -1)
# Comparación
Géneros de Música Colombiana.
Skip-grams.
Componente conexa de la red conformada con umbral 1.
## Paralabras más importantes
## Warning: The `scale` argument of `eigen_centrality()` is deprecated as of igraph 2.1.1.
## ℹ eigen_centrality() will always behave as if scale=TRUE were used.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 10 × 2
## word eigen
## <chr> <dbl>
## 1 quiero 1
## 2 solo 0.735
## 3 contigo 0.617
## 4 perderme 0.394
## 5 conmigo 0.206
## 6 otro 0.162
## 7 poco 0.146
## 8 encontrar 0.146
## 9 volar 0.132
## 10 decirle 0.128
## # A tibble: 10 × 2
## word eigen
## <chr> <dbl>
## 1 mami 1
## 2 bailando 0.844
## 3 sigue 0.844
## 4 vida 0.430
## 5 mia 0.430
## 6 pare 0.399
## 7 poderte 0.0289
## 8 suerte 0.0286
## 9 doy 0.0285
## 10 boom 0.0285
## # A tibble: 10 × 2
## word eigen
## <chr> <dbl>
## 1 solo 1
## 2 fotos 0.576
## 3 quiero 0.546
## 4 entender 0.492
## 5 cabeza 0.437
## 6 adentro 0.424
## 7 tuyas 0.318
## 8 nada 0.173
## 9 existe 0.170
## 10 cuido 0.170
## # A tibble: 10 × 2
## word eigen
## <chr> <dbl>
## 1 mas 1
## 2 quiero 0.842
## 3 bonita 0.352
## 4 jamas 0.324
## 5 aventura 0.303
## 6 ni 0.286
## 7 son 0.116
## 8 hacer 0.107
## 9 arte 0.107
## 10 mostrarte 0.0970
## Baladas Reggaeton Rock_canciones Salsa_canciones
## Tamaño partición 30 24 15 29
## Tamaño grupo menor 3 5 5 2
## Tamaño grupo mayor 53 32 26 53
## # A tibble: 5 × 3
## word cluster eigen
## <chr> <dbl> <dbl>
## 1 quiero 1 1
## 2 solo 1 0.735
## 3 contigo 1 0.617
## 4 perderme 1 0.394
## 5 conmigo 1 0.206
## # A tibble: 5 × 3
## word cluster eigen
## <chr> <dbl> <dbl>
## 1 verdad 5 0.00000668
## 2 siempre 5 0.00000179
## 3 dime 5 0.000000448
## 4 escuchaste 5 0.000000442
## 5 plata 5 0.000000361
## # A tibble: 5 × 3
## word cluster eigen
## <chr> <dbl> <dbl>
## 1 siento 4 0.000000488
## 2 morir 4 0.0000000779
## 3 calor 4 0.0000000221
## 4 gran 4 0.0000000203
## 5 alguna 4 0.0000000195
## # A tibble: 5 × 3
## word cluster eigen
## <chr> <dbl> <dbl>
## 1 pobre 6 0.00888
## 2 dios 6 0.00102
## 3 viajero 6 0.000676
## 4 llegan 6 0.000334
## 5 triste 6 0.0000784